28 research outputs found
Learning optical flow from still images
This paper deals with the scarcity of data for training optical flow
networks, highlighting the limitations of existing sources such as labeled
synthetic datasets or unlabeled real videos. Specifically, we introduce a
framework to generate accurate ground-truth optical flow annotations quickly
and in large amounts from any readily available single real picture. Given an
image, we use an off-the-shelf monocular depth estimation network to build a
plausible point cloud for the observed scene. Then, we virtually move the
camera in the reconstructed environment with known motion vectors and rotation
angles, allowing us to synthesize both a novel view and the corresponding
optical flow field connecting each pixel in the input image to the one in the
new frame. When trained with our data, state-of-the-art optical flow networks
achieve superior generalization to unseen real data compared to the same models
trained either on annotated synthetic datasets or unlabeled videos, and better
specialization if combined with synthetic images.Comment: CVPR 2021. Project page with supplementary and code:
https://mattpoggi.github.io/projects/cvpr2021aleotti
Learning End-To-End Scene Flow by Distilling Single Tasks Knowledge
Scene flow is a challenging task aimed at jointly estimating the 3D structure
and motion of the sensed environment. Although deep learning solutions achieve
outstanding performance in terms of accuracy, these approaches divide the whole
problem into standalone tasks (stereo and optical flow) addressing them with
independent networks. Such a strategy dramatically increases the complexity of
the training procedure and requires power-hungry GPUs to infer scene flow
barely at 1 FPS. Conversely, we propose DWARF, a novel and lightweight
architecture able to infer full scene flow jointly reasoning about depth and
optical flow easily and elegantly trainable end-to-end from scratch. Moreover,
since ground truth images for full scene flow are scarce, we propose to
leverage on the knowledge learned by networks specialized in stereo or flow,
for which much more data are available, to distill proxy annotations.
Exhaustive experiments show that i) DWARF runs at about 10 FPS on a single
high-end GPU and about 1 FPS on NVIDIA Jetson TX2 embedded at KITTI resolution,
with moderate drop in accuracy compared to 10x deeper models, ii) learning from
many distilled samples is more effective than from the few, annotated ones
available. Code available at:
https://github.com/FilippoAleotti/Dwarf-TensorflowComment: Accepted to AAAI 2020. Project page:
https://vision.disi.unibo.it/~faleotti/dwarf.htm
Monitoring social distancing with single image depth estimation
The recent pandemic emergency raised many challenges regarding the
countermeasures aimed at containing the virus spread, and constraining the
minimum distance between people resulted in one of the most effective
strategies. Thus, the implementation of autonomous systems capable of
monitoring the so-called social distance gained much interest. In this paper,
we aim to address this task leveraging a single RGB frame without additional
depth sensors. In contrast to existing single-image alternatives failing when
ground localization is not available, we rely on single image depth estimation
to perceive the 3D structure of the observed scene and estimate the distance
between people. During the setup phase, a straightforward calibration
procedure, leveraging a scale-aware SLAM algorithm available even on consumer
smartphones, allows us to address the scale ambiguity affecting single image
depth estimation. We validate our approach through indoor and outdoor images
employing a calibrated LiDAR + RGB camera asset. Experimental results highlight
that our proposal enables sufficiently reliable estimation of the
inter-personal distance to monitor social distancing effectively. This fact
confirms that despite its intrinsic ambiguity, if appropriately driven single
image depth estimation can be a viable alternative to other depth perception
techniques, more expensive and not always feasible in practical applications.
Our evaluation also highlights that our framework can run reasonably fast and
comparably to competitors, even on pure CPU systems. Moreover, its practical
deployment on low-power systems is around the corner.Comment: Accepted for pubblication on IEEE Transactions on Emerging Topics in
Computational Intelligence (TETCI
Real-time single image depth perception in the wild with handheld devices
Depth perception is paramount to tackle real-world problems, ranging from
autonomous driving to consumer applications. For the latter, depth estimation
from a single image represents the most versatile solution, since a standard
camera is available on almost any handheld device. Nonetheless, two main issues
limit its practical deployment: i) the low reliability when deployed
in-the-wild and ii) the demanding resource requirements to achieve real-time
performance, often not compatible with such devices. Therefore, in this paper,
we deeply investigate these issues showing how they are both addressable
adopting appropriate network design and training strategies -- also outlining
how to map the resulting networks on handheld devices to achieve real-time
performance. Our thorough evaluation highlights the ability of such fast
networks to generalize well to new environments, a crucial feature required to
tackle the extremely varied contexts faced in real applications. Indeed, to
further support this evidence, we report experimental results concerning
real-time depth-aware augmented reality and image blurring with smartphones
in-the-wild.Comment: 11 pages, 9 figure
On the confidence of stereo matching in a deep-learning era: a quantitative evaluation
Stereo matching is one of the most popular techniques to estimate dense depth
maps by finding the disparity between matching pixels on two, synchronized and
rectified images. Alongside with the development of more accurate algorithms,
the research community focused on finding good strategies to estimate the
reliability, i.e. the confidence, of estimated disparity maps. This information
proves to be a powerful cue to naively find wrong matches as well as to improve
the overall effectiveness of a variety of stereo algorithms according to
different strategies. In this paper, we review more than ten years of
developments in the field of confidence estimation for stereo matching. We
extensively discuss and evaluate existing confidence measures and their
variants, from hand-crafted ones to the most recent, state-of-the-art learning
based methods. We study the different behaviors of each measure when applied to
a pool of different stereo algorithms and, for the first time in literature,
when paired with a state-of-the-art deep stereo network. Our experiments,
carried out on five different standard datasets, provide a comprehensive
overview of the field, highlighting in particular both strengths and
limitations of learning-based strategies.Comment: TPAMI final versio